Topic: Unwrapping the Gift: Data Insights
An Explanatory Data Analysis for UK Based Online Gift Retail Company for Improving Sales and Gain Customers Insights
Introduction:
Dataset Available on the website of UC Irvine Machine Learning Repository.
Dataset Url: (https://archive.ics.uci.edu/dataset/502/online+retail+ii)
Business Context.
My client company is an e-commerce UK based non store retailer, who are specialized in selling uniqe and all occasion gifts through digital channels and mainy they provide service for wholesale market customers. They offers a vast collection of seasonalized gift products for different events.- Optimizing their Inventory,
- Advertisement and Promotions according to events and seasonal trends,
- Adjusting the pricing strategies according to diffrent regions and countries
- Boosting the sales performance.
Audience Overview
As my main task is to provide data based insights to support strategic decision making across the busines operations. Hence, My primary audience will be:- Founder/s (Business Owner/s)
- Leadership Team
- Marketing Team
- Sales Team
- Supply Chain and Operational Managers
- Customer Service Team
What My Audience Should Know and Do.?
My goal is to make my audience to understand the data-driven insights like key patterns, trends, opportunities and area of improvements. Main motive is to empower the audience to take strategic actions and to tackel the obstacls.
Datset Overview
The dataset of past sales and transactional data(from December 2009 to December 2011) is used in this data analysis. These dataset have the transactional activities of the company with the informations such as invoice number, product deatils, transaction details, and customer details.| Data | Type of Data |
|---|---|
| Invoice | Categorical/Text |
| StockCode | Categorical |
| Product | Text |
| Ordered_Quantity | Numerical (Integer) |
| InvoiceDate | Date/Time |
| Price | Numerical (Float) |
| Customer ID | Categorical/Integer |
| Country | Categorical/Text |
Data_Exploration:
Initial Setup: Initialising the reguired libraries and Data Import
# Library Imports for Explanatory Analysis
import pandas as pnds
import numpy as nmpy
import plotly.express as pltyx
import plotly.graph_objects as pltygo
import warnings
warnings.filterwarnings("ignore")
import textwrap
from textwrap import wrap
pnds.set_option("display.max_columns", 15)
Reading the Dataset using pandas
# Reading an excel file using .read_excel() pandas function
uci_orignial_dataset = pnds.read_excel("../../Main/M512_Data_Visualisation_and_Communication/online_retail_II 2.xlsx")
#Copying the orginal dataset
working_dataset = uci_orignial_dataset.copy()
Dataset Overview
#Checking the shape of the dataset
working_dataset.shape
(525461, 8)
The dataset contains orginally 525461 rows of transactional data with 8 columns of respective information
#Listing up the columns and their data type in to frame
dataset_column_info = working_dataset.dtypes
dataset_column_info = dataset_column_info.to_frame()
dataset_column_info.columns = ["dtypes"]
dataset_column_info
| dtypes | |
|---|---|
| Invoice | object |
| StockCode | object |
| Description | object |
| Quantity | int64 |
| InvoiceDate | datetime64[ns] |
| Price | float64 |
| Customer ID | float64 |
| Country | object |
#Checking the 5 rows of the dataset to look the data values
working_dataset.tail(5)
| Invoice | StockCode | Description | Quantity | InvoiceDate | Price | Customer ID | Country | |
|---|---|---|---|---|---|---|---|---|
| 525456 | 538171 | 22271 | FELTCRAFT DOLL ROSIE | 2 | 2010-12-09 20:01:00 | 2.95 | 17530.0 | United Kingdom |
| 525457 | 538171 | 22750 | FELTCRAFT PRINCESS LOLA DOLL | 1 | 2010-12-09 20:01:00 | 3.75 | 17530.0 | United Kingdom |
| 525458 | 538171 | 22751 | FELTCRAFT PRINCESS OLIVIA DOLL | 1 | 2010-12-09 20:01:00 | 3.75 | 17530.0 | United Kingdom |
| 525459 | 538171 | 20970 | PINK FLORAL FELTCRAFT SHOULDER BAG | 2 | 2010-12-09 20:01:00 | 3.75 | 17530.0 | United Kingdom |
| 525460 | 538171 | 21931 | JUMBO STORAGE BAG SUKI | 2 | 2010-12-09 20:01:00 | 1.95 | 17530.0 | United Kingdom |
From the above output, I can observer that some of the data types are wrongly interpreted by pandas like Customer ID, Invoice data type is float64. Also the dataset need some preprocessing like droping the index column, replacing the column header names to clear readability and seperating data and time.
#Renaming the columns header name of working dataset
working_dataset.columns = ["Invoice_Number",
"Product_Code",
"Product",
"Ordered_Quantity",
"Invoice_Date",
"Product_Price_in_Sterling",
"Customer_ID_Number",
"Customer_Region"]
Before I change the data type, I need to check the null values or data contain any ? or empty value in particular column to avoid complications.
#Checking the *?* values
data_have_question_mark = (working_dataset == "?").sum().sum()
print(data_have_question_mark)
45
#Replacing ? to nan
working_dataset = working_dataset.replace("?",nmpy.nan)
data_have_question_mark = (working_dataset == "?").sum().sum()
print(data_have_question_mark)
0
#Checking the missing values
null_values_info = working_dataset.isna().sum().to_frame(name="miss values")
null_values_info["% of miss values"] = (((working_dataset.isna().sum()) / len(working_dataset)) * 100).round(2)
null_values_info
| miss values | % of miss values | |
|---|---|---|
| Invoice_Number | 0 | 0.00 |
| Product_Code | 0 | 0.00 |
| Product | 2973 | 0.57 |
| Ordered_Quantity | 0 | 0.00 |
| Invoice_Date | 0 | 0.00 |
| Product_Price_in_Sterling | 0 | 0.00 |
| Customer_ID_Number | 107927 | 20.54 |
| Customer_Region | 0 | 0.00 |
The dataset have missing values in customer id column with (20%) and product Product column(0.6%). I need to handel this missing values first before i proceed further.
Dataset Cleaning
Before I also need to confirm that, there are no duplicate values. If present i will drop those
#Checking for duplicate data
print(f"The Dataset Contains {working_dataset.duplicated(keep = "first").sum()} Duplicate Rows")
The Dataset Contains 6865 Duplicate Rows
It is not always good to drop the data without proper justification, I needed to verify and compare each values of the respective rows and column, to understand was it a system level error or human error.
#Verification of duplicate data
duplicates = working_dataset[working_dataset.duplicated(keep = False)]
duplicates = duplicates.sort_values(by=["Invoice_Number",
"Product_Code",
"Product",
"Ordered_Quantity",
"Invoice_Date",
"Product_Price_in_Sterling",
"Customer_ID_Number",
"Customer_Region"])
duplicates.head(4)
| Invoice_Number | Product_Code | Product | Ordered_Quantity | Invoice_Date | Product_Price_in_Sterling | Customer_ID_Number | Customer_Region | |
|---|---|---|---|---|---|---|---|---|
| 379 | 489517 | 21491 | SET OF THREE VINTAGE GIFT WRAPS | 1 | 2009-12-01 11:34:00 | 1.95 | 16329.0 | United Kingdom |
| 391 | 489517 | 21491 | SET OF THREE VINTAGE GIFT WRAPS | 1 | 2009-12-01 11:34:00 | 1.95 | 16329.0 | United Kingdom |
| 365 | 489517 | 21821 | GLITTER STAR GARLAND WITH BELLS | 1 | 2009-12-01 11:34:00 | 3.75 | 16329.0 | United Kingdom |
| 386 | 489517 | 21821 | GLITTER STAR GARLAND WITH BELLS | 1 | 2009-12-01 11:34:00 | 3.75 | 16329.0 | United Kingdom |
After through inspection, I can confirm these duplicate datas doesnt provide me any additional information, I will be droping it.
Observation: Possible cause for duplication
These transactional data duplication seems to be from technical system level error,Hence it is recomended to assign unique transaction id with deduplication checks for each transactions.
working_dataset = working_dataset.drop_duplicates(keep="first").reset_index(drop=True)
After carefull observation, the 25% missing values of Customer_ID can not be guest check out, as the nature of the transaction is online and the business is operating in B2B model. It can be the error caused during data import. So I am going to drop the missing values of customer ID l. Additionally, I will be droping the missing values from product column.
#droping null value from customer_ID .
working_dataset = working_dataset.dropna(subset=["Customer_ID_Number"], axis=0)
#droping null value from product column
working_dataset = working_dataset.dropna(subset=["Product"], axis=0)
working_dataset.reset_index()
working_dataset.isna().sum().to_frame(name="miss values")
| miss values | |
|---|---|
| Invoice_Number | 0 |
| Product_Code | 0 |
| Product | 0 |
| Ordered_Quantity | 0 |
| Invoice_Date | 0 |
| Product_Price_in_Sterling | 0 |
| Customer_ID_Number | 0 |
| Customer_Region | 0 |
I have delt with the missing and null values. Now I am going to change the data types of Customer_ID column as it is Integer Value. And I am going to extract transaction date and time in to seperate column
#Changing Customer_ID Column Datatype to int
working_dataset["Customer_ID_Number"]=working_dataset["Customer_ID_Number"].astype("int")
working_dataset["Date"] = working_dataset["Invoice_Date"].dt.date
working_dataset["Day"] = working_dataset["Invoice_Date"].dt.day
#working_dataset["Month"] = working_dataset["Invoice_Date"].dt.month
working_dataset["Month"] = working_dataset["Invoice_Date"].dt.strftime("%b").str.upper()
working_dataset["Year"] = working_dataset["Invoice_Date"].dt.year
working_dataset["Time"] = working_dataset["Invoice_Date"].dt.time
working_dataset = working_dataset.drop(columns=["Invoice_Date"])
working_dataset.sample(5)
| Invoice_Number | Product_Code | Product | Ordered_Quantity | Product_Price_in_Sterling | Customer_ID_Number | Customer_Region | Date | Day | Month | Year | Time | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 241701 | 513095 | 22569 | FELTCRAFT CUSHION BUTTERFLY | 4 | 3.75 | 15122 | United Kingdom | 2010-06-21 | 21 | JUN | 2010 | 14:49:00 |
| 81125 | 496921 | 21531 | RETRO SPOT SUGAR JAM BOWL | 1 | 2.55 | 14081 | United Kingdom | 2010-02-04 | 4 | FEB | 2010 | 15:49:00 |
| 464728 | 533750 | 20682 | RED RETROSPOT CHILDRENS UMBRELLA | 2 | 3.25 | 15571 | United Kingdom | 2010-11-18 | 18 | NOV | 2010 | 14:17:00 |
| 7015 | 490011 | 21173 | LOO ROLL METAL SIGN | 3 | 1.65 | 16918 | United Kingdom | 2009-12-03 | 3 | DEC | 2009 | 12:27:00 |
| 155981 | 504436 | 21257 | VICTORIAN SEWING BOX MEDIUM | 48 | 6.95 | 14156 | EIRE | 2010-04-13 | 13 | APR | 2010 | 13:33:00 |
Descriptive Statistics
working_dataset[["Ordered_Quantity","Product_Price_in_Sterling"]].describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Ordered_Quantity | 410763.0 | 12.923735 | 102.039550 | -9360.0 | 2.00 | 5.00 | 12.00 | 19152.00 |
| Product_Price_in_Sterling | 410763.0 | 3.908358 | 71.714794 | 0.0 | 1.25 | 1.95 | 3.75 | 25111.09 |
As i can see the minimum Ordered_Quantity and price in negative, it can be either outlier or chances of transaction cancled, returned. Let me investigate them.
negative_price_data = working_dataset[working_dataset["Product_Price_in_Sterling"]<0]
negative_price_data
| Invoice_Number | Product_Code | Product | Ordered_Quantity | Product_Price_in_Sterling | Customer_ID_Number | Customer_Region | Date | Day | Month | Year | Time |
|---|
Observation: Negative Price
These above transactional data are indicating an adjustment of bad debt which can be due to many reasons like, Customer refuse to pay, Defective products, Delivery Problems,Financial adjustment, etc.negative_quantity_data = working_dataset[working_dataset["Ordered_Quantity"]<0]
negative_quantity_data
| Invoice_Number | Product_Code | Product | Ordered_Quantity | Product_Price_in_Sterling | Customer_ID_Number | Customer_Region | Date | Day | Month | Year | Time | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 178 | C489449 | 22087 | PAPER BUNTING WHITE LACE | -12 | 2.95 | 16321 | Australia | 2009-12-01 | 1 | DEC | 2009 | 10:33:00 |
| 179 | C489449 | 85206A | CREAM FELT EASTER EGG BASKET | -6 | 1.65 | 16321 | Australia | 2009-12-01 | 1 | DEC | 2009 | 10:33:00 |
| 180 | C489449 | 21895 | POTTING SHED SOW 'N' GROW SET | -4 | 4.25 | 16321 | Australia | 2009-12-01 | 1 | DEC | 2009 | 10:33:00 |
| 181 | C489449 | 21896 | POTTING SHED TWINE | -6 | 2.10 | 16321 | Australia | 2009-12-01 | 1 | DEC | 2009 | 10:33:00 |
| 182 | C489449 | 22083 | PAPER CHAIN KIT RETRO SPOT | -12 | 2.95 | 16321 | Australia | 2009-12-01 | 1 | DEC | 2009 | 10:33:00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 517833 | C538123 | 22956 | 36 FOIL HEART CAKE CASES | -2 | 2.10 | 12605 | Germany | 2010-12-09 | 9 | DEC | 2010 | 15:41:00 |
| 517834 | C538124 | M | Manual | -4 | 0.50 | 15329 | United Kingdom | 2010-12-09 | 9 | DEC | 2010 | 15:43:00 |
| 517835 | C538124 | 22699 | ROSES REGENCY TEACUP AND SAUCER | -1 | 2.95 | 15329 | United Kingdom | 2010-12-09 | 9 | DEC | 2010 | 15:43:00 |
| 517836 | C538124 | 22423 | REGENCY CAKESTAND 3 TIER | -1 | 12.75 | 15329 | United Kingdom | 2010-12-09 | 9 | DEC | 2010 | 15:43:00 |
| 518419 | C538164 | 35004B | SET OF 3 BLACK FLYING DUCKS | -1 | 1.95 | 14031 | United Kingdom | 2010-12-09 | 9 | DEC | 2010 | 17:32:00 |
9816 rows × 12 columns
Observation: Negative Ordered_Quantity with Zeor Price
These 252 transactional data can be return products or any backend inventory non standard transaction, due to damages,check,lost,mixed,short, or saled through other platform.I tried to change the datatype of the Invoice_Number to integer, But i could not, which indicated that there some values which are not integer, so i will investigate those.
working_dataset["Invoice_Number"].value_counts()
Invoice_Number
500356 251
507235 250
526089 240
511522 240
511051 236
...
C530282 1
C530279 1
517214 1
517215 1
C503745 1
Name: count, Length: 23587, dtype: int64
non_integer_invoice_number = working_dataset[working_dataset["Invoice_Number"].str.startswith("C", na=False)]
print(f"The Dataset Contains {len(non_integer_invoice_number)} rows of non integer invoice number")
non_integer_invoice_number.head(4)
The Dataset Contains 9816 rows of non integer invoice number
| Invoice_Number | Product_Code | Product | Ordered_Quantity | Product_Price_in_Sterling | Customer_ID_Number | Customer_Region | Date | Day | Month | Year | Time | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 178 | C489449 | 22087 | PAPER BUNTING WHITE LACE | -12 | 2.95 | 16321 | Australia | 2009-12-01 | 1 | DEC | 2009 | 10:33:00 |
| 179 | C489449 | 85206A | CREAM FELT EASTER EGG BASKET | -6 | 1.65 | 16321 | Australia | 2009-12-01 | 1 | DEC | 2009 | 10:33:00 |
| 180 | C489449 | 21895 | POTTING SHED SOW 'N' GROW SET | -4 | 4.25 | 16321 | Australia | 2009-12-01 | 1 | DEC | 2009 | 10:33:00 |
| 181 | C489449 | 21896 | POTTING SHED TWINE | -6 | 2.10 | 16321 | Australia | 2009-12-01 | 1 | DEC | 2009 | 10:33:00 |
Observation: Non integer Invoice number
There are 10182 transactional data that starts with string C.., whose Ordered_Quantity is also negative, which indicates that, these transactional data are canclled order transaction
All these above outliers or incorrect data can be droped. So I am dropping those values.
working_dataset = working_dataset.drop(working_dataset[working_dataset["Ordered_Quantity"] < 0].index)
working_dataset = working_dataset.drop(working_dataset[working_dataset["Product_Price_in_Sterling"] <= 0].index)
working_dataset = working_dataset.drop(working_dataset[working_dataset["Invoice_Number"].astype(str).str.startswith("C")].index)
working_dataset["Invoice_Number"] = working_dataset["Invoice_Number"].astype("int")
I have also noticed that, some of the product code seems anomalies(non-stock item codes or internal accounting codes) which deviates from normal product code value, I will inspect those values and remove such values.
prduct_code_values = working_dataset["Product_Code"].value_counts()
#Function to filter the abnormal values
def product_code_filter(value):
try:
product_code_each_value = str(value).strip()
if not product_code_each_value:
return False
numeric_digit_count = sum(code.isdigit() for code in product_code_each_value)
return numeric_digit_count == 5
except:
return False
anomalies_product_code_values = working_dataset[~working_dataset["Product_Code"].map(product_code_filter)]
anomal_product_code_data = pnds.DataFrame(
anomalies_product_code_values["Product_Code"].unique(), columns=["Anomalies"]
)
anomal_product_code_data
| Anomalies | |
|---|---|
| 0 | POST |
| 1 | C2 |
| 2 | M |
| 3 | BANK CHARGES |
| 4 | TEST001 |
| 5 | TEST002 |
| 6 | PADS |
| 7 | ADJUST |
| 8 | D |
| 9 | ADJUST2 |
| 10 | SP1002 |
#Removing the transactional data of Anomalies product values
working_dataset = working_dataset[~working_dataset.index.isin(anomalies_product_code_values.index)]
working_dataset = working_dataset.reset_index(drop=True)
working_dataset.sample(5)
| Invoice_Number | Product_Code | Product | Ordered_Quantity | Product_Price_in_Sterling | Customer_ID_Number | Customer_Region | Date | Day | Month | Year | Time | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 256401 | 522465 | 21094 | SET/6 RED SPOTTY PAPER PLATES | 12 | 0.85 | 16155 | United Kingdom | 2010-09-15 | 15 | SEP | 2010 | 08:43:00 |
| 191343 | 514161 | 20724 | RED SPOTTY CHARLOTTE BAG | 10 | 0.85 | 13875 | United Kingdom | 2010-06-30 | 30 | JUN | 2010 | 15:18:00 |
| 266055 | 523538 | 84077 | WORLD WAR 2 GLIDERS ASSTD DESIGNS | 96 | 0.29 | 16341 | United Kingdom | 2010-09-22 | 22 | SEP | 2010 | 14:01:00 |
| 371634 | 534765 | 21187 | WHITE BELL HONEYCOMB PAPER GARLAND | 3 | 1.65 | 15555 | United Kingdom | 2010-11-24 | 24 | NOV | 2010 | 10:25:00 |
| 88330 | 501291 | 84656 | WHITE ROSE C/COVER | 1 | 5.95 | 14343 | United Kingdom | 2010-03-15 | 15 | MAR | 2010 | 15:33:00 |
Now I have sucessfully droped all the data which are outliers or unusal datas.
working_dataset[["Ordered_Quantity","Product_Price_in_Sterling"]].describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Ordered_Quantity | 399552.0 | 13.801350 | 97.801534 | 1.00 | 2.00 | 5.00 | 12.00 | 19152.0 |
| Product_Price_in_Sterling | 399552.0 | 2.998056 | 4.310102 | 0.03 | 1.25 | 1.95 | 3.75 | 295.0 |
working_dataset[["Customer_ID_Number","Customer_Region","Product_Code"]].nunique().to_frame(name="unique values")
| unique values | |
|---|---|
| Customer_ID_Number | 4285 |
| Customer_Region | 37 |
| Product_Code | 4006 |
From the above output, I can understand that the company have customers and transaction were made from 40 countries, and as it mainly operates in B2B have great segmenting customers.
The data is cleaned and ready for further analysis.
#Cleaned dataset
cleaned_transactional_data = working_dataset.copy()
cleaned_transactional_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 399552 entries, 0 to 399551 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Invoice_Number 399552 non-null int64 1 Product_Code 399552 non-null object 2 Product 399552 non-null object 3 Ordered_Quantity 399552 non-null int64 4 Product_Price_in_Sterling 399552 non-null float64 5 Customer_ID_Number 399552 non-null int64 6 Customer_Region 399552 non-null object 7 Date 399552 non-null object 8 Day 399552 non-null int32 9 Month 399552 non-null object 10 Year 399552 non-null int32 11 Time 399552 non-null object dtypes: float64(1), int32(2), int64(3), object(6) memory usage: 33.5+ MB
Explanatory Analysis
Lets compute the total product transaction price of each row by multiplying the ordred Ordered_Quantity and price respectively.
# Computing Sales Column
cleaned_transactional_data["Total_Price_in_Sterling"] = cleaned_transactional_data["Ordered_Quantity"] * cleaned_transactional_data["Product_Price_in_Sterling"]
cleaned_transactional_data.sample(2)
| Invoice_Number | Product_Code | Product | Ordered_Quantity | Product_Price_in_Sterling | Customer_ID_Number | Customer_Region | Date | Day | Month | Year | Time | Total_Price_in_Sterling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 76942 | 499897 | 84510B | SET OF 4 FAIRY CAKES COASTERS | 20 | 1.06 | 14298 | United Kingdom | 2010-03-03 | 3 | MAR | 2010 | 11:53:00 | 21.2 |
| 77807 | 499973 | 21874 | GIN AND TONIC MUG | 2 | 1.25 | 13595 | United Kingdom | 2010-03-03 | 3 | MAR | 2010 | 14:42:00 | 2.5 |
Through out this notebook visualizations, I will apply the data visualizaion principles like decluttering, GESTALT Principles, Pre-attentive attributes, Thinkinking like a designer. I will be acheving this with plotly.
1. Top 10 Products
#Grouping the product
product_grouped = cleaned_transactional_data.groupby("Product").agg({
"Ordered_Quantity" : "sum",
"Total_Price_in_Sterling" : "sum"}).reset_index()
In the above cell, I have grouped the sales information by the product. And In the below cell, i will be filtering and sorting based on no of units sold and revenue generated by products with taking only the top 10 datas.
top10_product_qty = product_grouped.sort_values("Ordered_Quantity",ascending=False).head(10).reset_index(drop=True)
top10_product_sales = product_grouped.sort_values("Total_Price_in_Sterling",ascending=False).head(10).reset_index(drop=True)
To achive better visualization, i am pre defining, the colours that i am going to use, the format of labels and other required settings.
y_axis_position = list(range(len(top10_product_qty)))
colors_using_qty = ["#D72638"] + ["#D3D3D3"] * (len(top10_product_qty) - 1)
colors_using_sales = ["#596db9"] + ["#D3D3D3"] * (len(top10_product_sales) - 1)
bar_width = [1] + [0.5] * (len(top10_product_qty) - 1)
def text_wraping(label, width=25):
return "<br>".join(textwrap.wrap(label, width))
top10_product_qty["name"] = top10_product_qty["Product"].apply(wrap)
top10_product_sales["name"] = top10_product_sales["Product"].apply(wrap)
My main chart programming code starts here, I start by defining the empty figure and then i will add the traces of horizontal bars on the both sides with respective data. and required settings, annotations to adher to follow the data visualization principles.
butterfly_chart = pltygo.Figure()
butterfly_chart.add_trace(pltygo.Bar(
x=-top10_product_qty["Ordered_Quantity"],
y=y_axis_position,
width=bar_width,
orientation="h",
marker_color=colors_using_qty,
text=top10_product_qty["Ordered_Quantity"],
textposition="inside",
textfont=dict(size=12,weight="bold"),
cliponaxis=False,
hovertemplate="%{text} units<br>Product: %{customdata}<extra></extra>",
customdata=top10_product_qty["Product"]
))
butterfly_chart.add_trace(pltygo.Bar(
x=top10_product_sales["Total_Price_in_Sterling"],
y=y_axis_position,
width=bar_width,
orientation="h",
marker_color=colors_using_sales,
text=top10_product_sales["Total_Price_in_Sterling"].apply(lambda x: f"£{x:,.0f}"),
textposition="inside",
textfont=dict(size=12,weight="bold"),
cliponaxis=False,
hovertemplate="%{text}<br>Product: %{customdata}<extra></extra>",
customdata=top10_product_sales["Product"]
))
butterfly_chart.update_layout(
title_text="Top 10 Products",
title_x=0.5,
title_y=0.95,
title_font=dict(size=22, family="Arial Black", color="black"),
font=dict(family="Arial", size=13),
width=1000,
height=700,
bargap=0.2,
margin=dict(l=20, r=20, t=80, b=20),
showlegend=False,
barmode="relative",
yaxis=dict(
tickmode="array",
tickvals=y_axis_position,
ticktext=[""] * len(y_axis_position),
autorange="reversed",
showticklabels=False,
showline=False,
showgrid=False,
zeroline=False
),
xaxis=dict(
showticklabels=False,
zeroline=True,
zerolinecolor="grey",
showline=False,
showgrid=False,
),
plot_bgcolor="rgba(0,0,0,0)",
paper_bgcolor="rgba(0,0,0,0)",
transition=dict(
duration=500,
easing="cubic-in-out",
),
)
butterfly_chart.add_annotation(
x=-1000*35,
y=0,
text="Top product sold by Units ",
showarrow=True,
arrowhead=2,
ax=10,
ay=-50,
font=dict(size=14, color="red")
)
butterfly_chart.add_annotation(
x=1000*80,
y=0,
text="Top product with more Sales ",
showarrow=True,
arrowhead=2,
ax=10,
ay=-50,
font=dict(size=14, color="Blue")
)
for i, label in enumerate(top10_product_sales["name"]):
color = colors_using_sales[0] if i == 0 else "black"
text_value = label[0] if isinstance(label, list) else str(label)
butterfly_chart.add_annotation(
y=y_axis_position[i],
x=1000*155,
text= text_value,
showarrow=False,
font=dict(size=11, weight="bold" if i == 0 else "normal",color=color),
xanchor="left",
align="left",
)
for i, label in enumerate(top10_product_qty["name"]):
color = colors_using_qty[0] if i == 0 else "black"
text_value = label[0] if isinstance(label, list) else str(label)
butterfly_chart.add_annotation(
y=y_axis_position[i],
x=-1000*60,
text=text_value,
showarrow=False,
font=dict(size=11, weight="bold" if i == 0 else "normal",color=color),
xanchor="right",
align="right",
)
butterfly_chart.show()
Observation: From Butterfly chart
- The Product White hanging heart t-light holder is the best performing and valuable product in both criteria with 56,81 Units sold and generated revenue of 151,339 Sterlings
- On the other sided, World war 2 Glider has 2nd top unit sold, but its not in the top 10 revenue generated produt list.
Selling more quantity does not always generate more revenue. The business need to rearrange their pricing stratergy
2. Sales By Country Information
country_grouped_sales = cleaned_transactional_data.groupby("Customer_Region")["Total_Price_in_Sterling"].sum().reset_index()
In the above code, I have grouped the transactions based on the county and the revenue. Now i will sort them and filter top 10.
country_grouped_sales["rank"] = country_grouped_sales["Total_Price_in_Sterling"].rank(method="first", ascending=False).astype(int)
country_grouped_sales["colour"] = country_grouped_sales["rank"].apply(
lambda r: "#D72638" if r == 1 else "#596db9" if r <= 5 else "#D3D3D3"
)
country_grouped_sales["Customer_Region"] = country_grouped_sales["Customer_Region"].replace({
"EIRE": "Ireland"
})
country_grouped_sales["label"] = (
"£" + country_grouped_sales["Total_Price_in_Sterling"].round(2).astype(str)
)
top_2_to_5 = country_grouped_sales[
(country_grouped_sales["rank"] >= 2) & (country_grouped_sales["rank"] <= 5)
].sort_values(by="rank").reset_index(drop=True)
corner_positions = [
(0.01, 0.99), # Top Left
(0.99, 0.99), # Top Right
(0.01, 0.01), # Bottom Left
(0.99, 0.01) # Bottom Right
]
To achive better visualization, i am pre defining, the colours that i am going to use, the format of labels and other required settings. My main chart programming code starts here, I start by defining the empty figure and then i will add the traces with respective data. and required settings, annotations to adher to follow the data visualization principles.
map_plot = pltyx.choropleth(
country_grouped_sales,
locations="Customer_Region",
locationmode="country names",
color="colour",
color_discrete_map={
"#D3D3D3": "#D3D3D3",
"#596db9": "#596db9",
"#D72638": "#D72638"
},
custom_data=["Customer_Region", "label"],
)
map_plot.update_layout(
title_text="Top Revenue-Generating Countries",
title_x=0.5,
title_y=0.95,
title_font=dict(size=22, family="Arial Black", color="black"),
font=dict(family="Arial", size=13),
width=1050,
height=650,
dragmode=False,
showlegend=False,
geo=dict(
projection_scale=1,
center=dict(lat=20, lon=0),
showland=True,
landcolor="#126e0c",
showframe=False,
showcoastlines=False,
showcountries=True,
countrycolor="white",
showocean=True,
oceancolor="white",
lakecolor="white",
lonaxis=dict(showgrid=False),
lataxis=dict(showgrid=False),
),
template="plotly_white"
)
map_plot.update_traces(
hovertemplate="<b>%{customdata[0]}</b><br>Revenue: %{customdata[1]}<extra></extra>"
)
map_plot.add_annotation(
text="<b>United Kingdom</b><br>" +
"£151,339<br>" +
"Revenue Leader<br>" +
"● ● ● ● ● ● ○",
x=0.5,
y=0.845,
font=dict(size=20, color="red"),
xref="paper",
yref="paper"
)
for i, row in top_2_to_5.iterrows():
x, y = corner_positions[i]
# Annotation content
text = (
f"<b>{row["Customer_Region"]}</b><br>"
f"£{row["Total_Price_in_Sterling"]:,.0f}<br>"
f"Rank: {int(row["rank"])}"
)
# Add annotation to map
map_plot.add_annotation(
x=x,
y=y,
xref="paper",
yref="paper",
showarrow=False,
align="left" if x < 0.5 else "right",
text=text,
font=dict(size=15, color="#596db9"),
bgcolor="white",
opacity=0.95
)
map_plot.show()
Observation: From Geo Plot
- Since the company operates in UK, UK's customers appear to be the largest with 151339 Revenue.
- The real financial heavy listers are Ireland, Netherland, France and Germany.
- Company lack performance in Asian and Middle-east regions, providing an oppourtinaty for diversify market.
- Mainly dependent on Europen Market and can be risky.
3. Monthly Sales Distribution
sales_data = cleaned_transactional_data.groupby(["Year", "Month"])["Total_Price_in_Sterling"].sum().reset_index()
order_month = ["JAN", "FEB", "MAR", "APR", "MAY", "JUN",
"JUL", "AUG", "SEP", "OCT", "NOV", "DEC"]
month_to_num = {
"JAN": 1, "FEB": 2, "MAR": 3, "APR": 4, "MAY": 5, "JUN": 6,
"JUL": 7, "AUG": 8, "SEP": 9, "OCT": 10, "NOV": 11, "DEC": 12
}
sales_data["Month"] = pnds.Categorical(sales_data["Month"], categories=order_month, ordered=True)
sales_data["Month_Num"] = sales_data["Month"].map(month_to_num)
sales_data["Month_Year"] = pnds.to_datetime(
sales_data["Year"].astype(str) + "-" + sales_data["Month_Num"].astype(str) + "-01"
)
sales_data["label"] = sales_data["Month"].astype(str) + " " + sales_data["Year"].astype(str)
sales_data = sales_data.sort_values("Month_Year").reset_index(drop=True)
sales_data
| Year | Month | Total_Price_in_Sterling | Month_Num | Month_Year | label | |
|---|---|---|---|---|---|---|
| 0 | 2009 | DEC | 677916.07 | 12 | 2009-12-01 | DEC 2009 |
| 1 | 2010 | JAN | 533712.98 | 1 | 2010-01-01 | JAN 2010 |
| 2 | 2010 | FEB | 497937.57 | 2 | 2010-02-01 | FEB 2010 |
| 3 | 2010 | MAR | 665973.67 | 3 | 2010-03-01 | MAR 2010 |
| 4 | 2010 | APR | 585127.43 | 4 | 2010-04-01 | APR 2010 |
| 5 | 2010 | MAY | 592734.13 | 5 | 2010-05-01 | MAY 2010 |
| 6 | 2010 | JUN | 628809.40 | 6 | 2010-06-01 | JUN 2010 |
| 7 | 2010 | JUL | 581487.96 | 7 | 2010-07-01 | JUL 2010 |
| 8 | 2010 | AUG | 594561.17 | 8 | 2010-08-01 | AUG 2010 |
| 9 | 2010 | SEP | 805544.81 | 9 | 2010-09-01 | SEP 2010 |
| 10 | 2010 | OCT | 1011556.75 | 10 | 2010-10-01 | OCT 2010 |
| 11 | 2010 | NOV | 1155978.39 | 11 | 2010-11-01 | NOV 2010 |
| 12 | 2010 | DEC | 308302.12 | 12 | 2010-12-01 | DEC 2010 |
In the above cell, i have grouped transactional data by year and month to compute the total revenue for each period. Then, I had to define the correct order to plot the chart correctly, also I did convert the months and sort the sales data chronologically. Now i will calculate the minimum and maximum sales to indicate in the chart.
max_row = sales_data.loc[sales_data["Total_Price_in_Sterling"].idxmax()]
min_row = sales_data.loc[sales_data["Total_Price_in_Sterling"].idxmin()]
Now, I will create a line chart that shows sales over time, and will highlight the maximum and minimum salest points, and also i will draw average sales line. I will be adding the required settings, annotations to adher to follow the data visualization principles.
line_chart = pltygo.Figure()
line_chart.add_trace(pltygo.Scatter(
x=sales_data["label"],
y=sales_data["Total_Price_in_Sterling"],
mode="lines",
name="Sales",
line=dict(color="royalblue", width=3),
hovertemplate="Month: %{x}<br>Sales: £%{y:,.0f}<extra></extra>",
))
line_chart.add_trace(pltygo.Scatter(
x=[max_row["label"]],
y=[max_row["Total_Price_in_Sterling"]],
mode="text",
text=[f"High: £{max_row["Total_Price_in_Sterling"] / 1e3:.1f}K"],
textposition="top center",
textfont=dict(color="green", size=14, family="Arial Black"),
showlegend=False
))
line_chart.add_trace(pltygo.Scatter(
x=[min_row["label"]],
y=[min_row["Total_Price_in_Sterling"]],
mode="text",
text=[f"Low: £{min_row["Total_Price_in_Sterling"] / 1e3:.1f}K " ""],
textposition="middle left",
textfont=dict(color="red", size=14, family="Arial Black"),
showlegend=False
))
avg_val = sales_data["Total_Price_in_Sterling"].mean()
line_chart.add_hline(
y=avg_val,
line_dash="dot",
line_color="gray",
annotation_text=f"Avg: £{avg_val/1e3:.1f}K ",
annotation_position="bottom right",
annotation_font=dict(size=11, color="gray")
)
line_chart.update_layout(
title_text="Monthly Sales Trend",
title_x=0.5,
title_y=0.95,
title_font=dict(size=22, family="Arial Black", color="black"),
margin=dict(t=100, b=50, l=50, r=10),
plot_bgcolor="white",
hovermode="closest",
showlegend=False,
width=1050,
height=650,
xaxis=dict(
showgrid=True,
showline=False,
zeroline=False,
tickmode="array",
tickvals=sales_data["label"].tolist(),
tickangle=-30,
tickfont=dict(size=13),
gridcolor="lightgray",
gridwidth=0.5
),
yaxis=dict(
showgrid=False,
showline=False,
zeroline=False,
tickfont=dict(size=13),
)
)
line_chart.add_annotation(
text="Data from 2009–2010 | Values in £",
x=0.5, y=1.08,
xref="paper", yref="paper",
showarrow=False,
font=dict(size=12, color="gray")
)
line_chart.add_vrect(
x0="SEP 2010", x1="DEC 2010",
fillcolor="lightblue", opacity=0.2,
layer="below", line_width=0
)
line_chart.add_vrect(
x0="DEC 2009", x1="MAR 2010",
fillcolor="#FFB3B3", opacity=0.2,
layer="below", line_width=0
)
line_chart.show()
Observation: From Line chart
- The company's most revenue was on NOV 2010 indicating that was due seasonal events like Blackfriday, Chrismass ect, it is as simila to previous year, but i cant say it for sure as i dont have enough data to prove.
- The company underperformed during Jan-Mar 2010 as shaded in red area, indicating post holiday slum.
4. Peak Sales Hours
As as Saying, Everything has its onw time, its also true in the business contex, each business need to digest the fact that they cant do business through out the day. PEAK PERFORMING period is an important insight to make informs marketing, operational and marketing decisions, scheduling ad-campains, sending marketing emails,etc.
cleaned_transactional_data["Time"]= pnds.to_datetime(cleaned_transactional_data["Time"],format="%H:%M:%S")
cleaned_transactional_data["Hour"]= cleaned_transactional_data["Time"].dt.hour
Sales_hourly = cleaned_transactional_data.groupby("Hour")["Total_Price_in_Sterling"].sum().reset_index()
Sales_hourly = Sales_hourly.sort_values("Hour")
I initally changed the data type of Time to pandas datatype then i the hour, then grouped it by hoursnand sales. In the below cell, i will be calculating and mapping the time to angles as i am usind radial chart and will binn the data in to groups like morning, afternoon, evening and night.
Sales_hourly["hours_to_angle"] = Sales_hourly["Hour"] * 15
hour_bins = [
{"name": "Night A", "start": 315, "end": 360, "color": "rgba(255, 200, 200, 0.15)"},
{"name": "Night B", "start": 0, "end": 90, "color": "rgba(255, 200, 200, 0.15)"},
{"name": "Morning", "start": 90, "end": 180, "color": "rgba(255, 255, 200, 0.15)"},
{"name": "Afternoon", "start": 180, "end": 255, "color": "rgba(200, 255, 200, 0.15)"},
{"name": "Evening", "start": 255, "end": 315, "color": "rgba(200, 200, 255, 0.15)"}
]
tick_labels = [0, 3, 6, 9, 12, 15, 18, 21]
labels_zone = {
"Night": (345, 1.15),
"Morning": (75, 1.15),
"Afternoon": (195, 1.15),
"Evening": (270, 1.15)
}
Sales_hourly["custom_hour"] = Sales_hourly["Hour"].apply(lambda h: f"{h:02d}:00")
peak_hour = Sales_hourly.loc[Sales_hourly["Total_Price_in_Sterling"].idxmax()]
bar_colors = ["gray"] * len(Sales_hourly)
peak_start = peak_hour["Hour"] * 15
peak_end = peak_start + 15
for i, angle in enumerate(Sales_hourly["hours_to_angle"]):
if peak_start <= angle < peak_end:
bar_colors[i] = "#596db9"
To achive better visualization, i am pre defining, the colours that i am going to use, the format of labels and other required settings. My main chart programming code starts next, I start by defining the empty figure and then i will add the traces with respective data. and required settings, annotations to adher to follow the data visualization principles.
radial_chart = pltygo.Figure()
radial_chart.add_trace(pltygo.Barpolar(
r=Sales_hourly["Total_Price_in_Sterling"],
theta=Sales_hourly["hours_to_angle"],
width=[15] * len(Sales_hourly),
marker=dict(
color=bar_colors,
opacity=0.9,
line=dict(color="white", width=0.5)
),
customdata=Sales_hourly["custom_hour"],
hovertemplate="<b>%{customdata}</b><br>Sales: %{r:,.0f}<extra></extra>"
))
radial_chart.update_layout(
title_text="Peak Sales Hour Information",
title_x=0.57,
title_y=0.92,
title_font=dict(size=24, family="Arial Black", color="black"),
font=dict(family="Arial", size=13),
polar=dict(
bgcolor="white",
angularaxis=dict(
direction="clockwise",
rotation=90,
tickmode="array",
tickvals=[i * 15 for i in tick_labels],
ticktext=[f"{i}:00" for i in tick_labels],
tickfont=dict(size=12, color="gray"),
showline=False,
showgrid=True,
gridcolor="lightgray",
gridwidth=0.3
),
radialaxis=dict(visible=False)
),
paper_bgcolor="white",
plot_bgcolor="white",
margin=dict(l=200, r=80, t=100, b=60),
width=1050,
height=650,
showlegend=False
)
radial_chart.add_annotation(
text=f"<b>Peak Hour: {peak_hour["Hour"]}:00 PM</b>",
x=0.5, y=0.55,
xref="paper", yref="paper",
showarrow=False,
font=dict(size=16, color="red")
)
radial_chart.add_annotation(
text=f"<b>Prime Activity Hours<br>From 10 AM To 1 PM</b>",
x=-0.25, y=0.5,
xref="paper", yref="paper",
showarrow=False,
font=dict(size=20, color="#596db9")
)
max_radius = Sales_hourly["Total_Price_in_Sterling"].max() * 1.05
for zone in hour_bins:
start = zone["start"]
end = zone["end"]
angle_range = list(range(start, end if end > start else 360 + end))
radial_chart.add_trace(pltygo.Scatterpolar(
r=[max_radius] * len(angle_range),
theta=angle_range,
mode="lines",
fill="toself",
fillcolor=zone["color"],
line=dict(color="rgba(0,0,0,0)"),
hoverinfo="skip",
showlegend=False
))
placed_labels = set()
for zone in hour_bins:
start = zone["start"]
end = zone["end"]
label = zone["name"].replace(" A", "").replace(" B", "")
if label in placed_labels:
continue
placed_labels.add(label)
mid_angle = (90 - ((start + (end - start) / 2) % 360)) % 360
is_bold = label in ["Morning", "Afternoon"]
radius = 0.6 if is_bold else 0.45
x = 0.5 + radius * nmpy.cos(nmpy.radians(mid_angle))
y = 0.5 + radius * nmpy.sin(nmpy.radians(mid_angle))
radial_chart.add_annotation(
text=f"<b>{label}</b>" if is_bold else label,
x=x,
y=y,
xref="paper", yref="paper",
showarrow=False,
font=dict(size=15 if is_bold else 13, color="gray"),
align="center"
)
radial_chart.show()
Observation: From Radial chart
- The company's Peak sales activity happned during 12 PM indicating, highest customer engagement around midday during lunch or break hours.
- Prime activity period is from 10am to 1pm, indicating its a good time for promotional activites, like flash sales, limited edition etc to increase more engagement.
- Also hepls in staffing stratergy to be placed during this hours like live support, customer care ect.
5. Dead Stock Alert
cleaned_transactional_data["Date"] = pnds.to_datetime(cleaned_transactional_data["Date"])
last_transactional_date = cleaned_transactional_data["Date"].max()
date_3_months_back = last_transactional_date - pnds.DateOffset(months=3)
sales_last_3_months = cleaned_transactional_data[cleaned_transactional_data["Date"] >= date_3_months_back].groupby("Product").agg({
"Ordered_Quantity": "sum",
"Total_Price_in_Sterling": "sum"
}).reset_index()
overal_data = cleaned_transactional_data.groupby("Product").agg({
"Ordered_Quantity": "sum",
"Total_Price_in_Sterling": "sum"
}).reset_index()
transactions_of_dead_stock = overal_data[~overal_data["Product"].isin(sales_last_3_months["Product"])]
transactions_of_dead_stock = transactions_of_dead_stock.nlargest(15, "Ordered_Quantity")
transactions_of_dead_stock = transactions_of_dead_stock.sort_values(by="Ordered_Quantity", ascending=False).reset_index(drop=True)
def wrap_label(text, width=18):
return "<br>".join(textwrap.wrap(text, width=width))
transactions_of_dead_stock["label"] = transactions_of_dead_stock["Product"].apply(lambda x: wrap_label(x))
colour = ["gold", "silver", "#cd7f32"] + ["#E74C3C"] * (len(transactions_of_dead_stock) - 3)
lolipop_heigh = transactions_of_dead_stock["Ordered_Quantity"] * 0.9
Firstly, I did the three months before date indentification from the last transaction to filter the data based on the quantity and sales in the last three months, then i did tally with overall individual product quantity and sales and identified the 15 products with no sales-which are dead stock in this analysis.
To achive better visualization, i am pre defining, the colours that i am going to use, the format of labels and other required settings. My main chart programming code starts next, I start by defining the empty figure and then i will add the traces with respective data. and required settings, annotations to adher to follow the data visualization principles.
lolipop_chart = pltygo.Figure()
lolipop_chart.add_trace(pltygo.Bar(
x=transactions_of_dead_stock["label"],
y=lolipop_heigh,
marker_color="rgba(180,180,180,0.3)",
width=0.25,
hoverinfo="skip"
))
lolipop_chart.add_trace(pltygo.Scatter(
x=transactions_of_dead_stock["label"],
y=transactions_of_dead_stock["Ordered_Quantity"],
mode="markers",
marker=dict(
color=colour,
size=14,
line=dict(color="white", width=2)
),
customdata=transactions_of_dead_stock["Total_Price_in_Sterling"],
hovertemplate="<b>%{x}</b><br>Qty: %{y:,}<br>Sales: £%{customdata:,.2f}<extra></extra>"
))
lolipop_chart.add_annotation(
text="Highest Overstock",
x=transactions_of_dead_stock["label"][0],
y=transactions_of_dead_stock["Ordered_Quantity"][0] + 2000,
showarrow=True,
arrowhead=2,
arrowcolor="crimson",
font=dict(size=13, color="crimson"),
ax=0,
ay=-40
)
lolipop_chart.update_layout(
title=dict(
text="Dead Stock Products from last 3 Months",
x=0.5,
font=dict(size=24, family="Arial Black")
),
width=1050,
height=600,
bargap = 0.4,
margin=dict(l=40, r=40, t=90, b=160),
plot_bgcolor="white",
paper_bgcolor="white",
font=dict(family="Arial", size=12),
xaxis=dict(
tickangle=-35,
showline=False,
showgrid=False,
ticks="",
title="",
tickfont=dict(size=11)
),
yaxis=dict(
#title="Ordered Quantity",
showgrid=True,
gridcolor="rgba(220,220,220,0.5)",
zeroline=False,
tickfont=dict(size=11)
),
showlegend=False
)
lolipop_chart.show()
Observation: From Lolipop chart
- The product ** Black and white Paisle flower mug** is the highest unsold with 30000 units, indicating potiential issues with demand of that product or failed marketing.
- Almot other six produts with units 12-14k units are also not performing and taking up the inventory space. It is definety required to change the marketing and operational approach.
Conclusion:
This Explanatory analysis pipeline was structued to help the client to understand their sales performance, including knowing the dead-stock products. I have tried to answer all the key challenges mentioned above through this analysis visual approach, and discovered the crirical patterns and some key obseravtions. I have used, Butterfly,Radial,Lolipop,line charts and Geo plot to achive the objective of this assignment and task.
Strength:
- I think structure of the report and analysis is one of the strenght
- Visual reprsentation following the well thought visual principles from my professor like decluttering, and many.
- Cognitive Design approach as me thinking both as designer and audience
I belive, this EDA will be valuable to UK based gift company to take effective actions.
In this Analysis, the process i have followed (wherever necessary):
- Removed chart borders
- Removed gridlines
- Removed data markers where unnecessary
- Cleaned up axis labels
- Labeled data directly on charts
- Used consistent colour schemes
- Highlighted important data points
- Eliminated distractions
- Maintained accessibility
- Focused on aesthetics
- Avoided overcomplication
- Proximity
- Similarity
- Continuity
- Closure
- Figure/Ground
- Common Fate
- Preattentive Attributes
- Size
- Colour (used sparingly)
- Position on page
- Affordance (intuitive chart design)
Skills Displayed¶
- Data Cleaning & Preparation: Handling missing data, duplicates, and ensuring data integrity.
- Explanatory Data Analysis (EDA): Interpreting and presenting data trends and patterns to inform business decisions.
- Data Visualization: Creating clear and actionable visual insights using Plotly to communicate findings effectively.
- Business Strategy Insights: Translating data into actionable recommendations for inventory, pricing, and marketing strategies.
- Effective Communication: Presenting complex data insights in a simple, understandable format for business stakeholders.